import pandas as pd
import warnings
warnings.filterwarnings("ignore")
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
'hours-per-week', 'native-country', 'income']
train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
train_df = pd.read_csv(train_url, names=names, na_values='?')
train_df = train_df.dropna()
test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
test_df = pd.read_csv(test_url, names=names, na_values='?', skiprows=1)
test_df = test_df.dropna()
feature_names_before_process = list(train_df.columns)
It is necessary to first understand the dataset and make a preprocess for the data. We can learn from the code above that this dataset has 14 features 'age','workclass','fnlwgt','education','education-num','marital-status''occupation''relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country', and one target 'income'.
In order to make the trainning process more efficient and reduce some noises, we can merge some features with similar meaning, eg. 'education' and 'eduaction-num'; 'maritial-status' and 'relationship', and treat some non-figure features with extremely small prevalence as others.
train_df = train_df.replace(' ?', ' Other')
test_df = test_df.replace(' ?', ' Other')
import matplotlib.pyplot as plt
fig, axs = plt.subplots(nrows=2, ncols=4, figsize=(20, 10))
def plot_data_distribution(feature, ax):
counts = train_df[feature].value_counts()
# Calculate the proportions
proportions = counts / counts.sum()
# Plot the pie chart
ax.pie(counts, startangle=90, autopct='%.1f%%')
ax.legend(proportions.index, loc=2, fontsize=8)
ax.set_title(feature, size=20)
features_non_figure = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
for i, feature in enumerate(features_non_figure):
row = i // 4
col = i % 4
plot_data_distribution(feature, axs[row, col])
plt.show()
We can learn from the plots above that some features have a small representation in the dataset, accounting for less than 5% of the total values. To simplify the data and avoid overfitting on these rare values, we group them together and label them as "Other".
# Define a function to replace rare values with "other"
def replace_rare_values(feature, threshold):
counts = train_df[feature].value_counts()
rare_values = counts[counts / counts.sum() < threshold].index
train_df.loc[train_df[feature].isin(rare_values), feature] = " Other"
test_df.loc[test_df[feature].isin(rare_values), feature] = " Other"
# Call the function for each feature with a threshold of 1%
for feature in ['workclass', 'occupation', 'race', 'native-country', 'marital-status']:
replace_rare_values(feature, 0.05)
#Target variable
y_train = [1 if y == ' >50K' else 0 for y in train_df['income']]
#Model features
X_train = train_df[['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status', 'occupation', 'race', 'sex', 'capital-gain', 'capital-loss',
'hours-per-week','native-country']].copy()
#Target variable
y_test = [1 if y == ' >50K.' else 0 for y in test_df['income']]
#Model features
X_test = test_df[['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status', 'occupation', 'race', 'sex', 'capital-gain', 'capital-loss',
'hours-per-week','native-country']].copy()
There are two main ways to encode data for machine learning: one-hot encoding and label encoding. Although a fundamental test I ran before to see the accuracy found little difference between the two methods, label encoding can be problematic for certain features like occupation and native country, as it assigns ranking numbers that may not be appropriate. Therefore, this coursework uses one-hot encoding to minimize bias in data preprocessing as much as possible.
# Get dummy variables for categorical features in training set
X_train = pd.get_dummies(X_train)
# Get dummy variables for categorical features in test set
X_test = pd.get_dummies(X_test)
# find the optimized hyperprameters of large decision tree
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
# the range of finding hyperprameter of large decision tree
param_grid_large = {'max_depth': range(6, 21, 2),
'min_samples_split': range(2, 7, 2),
'min_samples_leaf': range(2,11,2)
}
clf_large_dt = DecisionTreeClassifier(random_state=42)
grid_search_large_dt = GridSearchCV(clf_large_dt, param_grid_large, cv=5, scoring='accuracy')
grid_search_large_dt.fit(X_train, y_train)
# print the optimized hyperprameters
print('Best hyperparameters: {}'.format(grid_search_large_dt.best_params_))
print('Best accuracy: {:.3f}'.format(grid_search_large_dt.best_score_))
Best hyperparameters: {'max_depth': 10, 'min_samples_leaf': 10, 'min_samples_split': 2}
Best accuracy: 0.857
import os
os.environ["PATH"] += os.pathsep + 'F:/Graphviz/bin'
from sklearn.tree import export_graphviz
from sklearn.metrics import accuracy_score
import graphviz
# Build 1st decision tree model
clf_large_dt = DecisionTreeClassifier(max_depth=grid_search_large_dt.best_params_['max_depth'],
min_samples_split=grid_search_large_dt.best_params_['min_samples_split'],
min_samples_leaf = grid_search_large_dt.best_params_['min_samples_leaf'],
random_state=42)
clf_large_dt.fit(X_train, y_train)
y_pred_large_dt = clf_large_dt.predict(X_test)
acc = accuracy_score(y_test, y_pred_large_dt)
print('Accuracy:', acc)
# plot decision tree with Graphviz
dot_data = export_graphviz(clf_large_dt, out_file='./decision_tree_large.dot',
feature_names=X_train.columns.tolist(),
class_names=['less_than_fifty_thousands', 'greater_than_fifty_thousands'],
filled=True, rounded=True,
special_characters=True)
# save the png
! dot -Tpng ./decision_tree_large.dot -o ./decision_tree_large.png
# Display png
from IPython.display import Image
Image(filename='decision_tree_large.png')
Accuracy: 0.8628462625145875
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.610607 to fit
# find the optimized hyperprameters of small decision tree
# the range of finding hyperprameter of small decision tree
param_grid_small = {'max_depth': range(1, 4),
'min_samples_split': range(2, 7, 2),
'min_samples_leaf': range(1, 6)
}
clf_small_dt = DecisionTreeClassifier(random_state=42)
grid_search_small_dt = GridSearchCV(clf_small_dt, param_grid_small, cv=5, scoring='accuracy')
grid_search_small_dt.fit(X_train, y_train)
# print the optimized hyperprameters
print('Best hyperparameters: {}'.format(grid_search_small_dt.best_params_))
print('Best accuracy: {:.3f}'.format(grid_search_small_dt.best_score_))
Best hyperparameters: {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best accuracy: 0.844
# Build 1st decision tree model
clf_small_dt = DecisionTreeClassifier(max_depth=grid_search_small_dt.best_params_['max_depth'],
min_samples_split=grid_search_small_dt.best_params_['min_samples_split'],
min_samples_leaf = grid_search_small_dt.best_params_['min_samples_leaf'],
random_state=42)
clf_small_dt.fit(X_train, y_train)
y_pred_small_dt = clf_small_dt.predict(X_test)
acc = accuracy_score(y_test, y_pred_small_dt)
print('Accuracy:', acc)
# make decision tree tp Graphviz
dot_data = export_graphviz(clf_small_dt, out_file='./decision_tree_small.dot',
feature_names=X_train.columns.tolist(),
class_names=['less_than_fifty_thousands', 'greater_than_fifty_thousands'],
filled=True, rounded=True,
special_characters=True)
# save the png
! dot -Tpng ./decision_tree_small.dot -o ./decision_tree_small.png
# Display png
from IPython.display import Image
Image(filename='decision_tree_small.png')
Accuracy: 0.8447884036607088
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
# Difine classifier
clf_xgb = xgb.XGBClassifier(
objective='binary:logistic',
random_state=42
)
# hyperprameters range for optimiaztion
params_xgb = {
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': range(3,11),
}
# find optimized hyperprameters with grid search
grid_search = GridSearchCV(
estimator=clf_xgb,
param_grid=params_xgb,
cv=5,
scoring='roc_auc'
)
grid_search.fit(X_train, y_train)
# print the hyperprameters
print("Best parameters found: ", grid_search.best_params_)
print('Best accuracy: {:.3f}'.format(grid_search.best_score_))
# train the model
clf_xgb = xgb.XGBClassifier(**grid_search.best_params_, random_state=42)
clf_xgb.fit(X_train, y_train)
# predict the model
y_pred_xgb = clf_xgb.predict(X_test)
Best parameters found: {'learning_rate': 0.1, 'max_depth': 6}
Best accuracy: 0.926
One way to compare the performace between models is to see the area under the curve, the bigger area indicates a better model.
from sklearn.metrics import roc_curve, auc
fig, ax = plt.subplots()
models = {
'clf_small_dt': clf_small_dt,
'clf_large_dt': clf_large_dt,
'xgb_clf': clf_xgb
}
colors = {
'clf_small_dt': 'r',
'clf_large_dt': 'g',
'xgb_clf': 'b'
}
# calculate AUC value
for model_name, model in models.items():
y_score = model.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
ax.plot(fpr, tpr, color=colors[model_name], lw=2, alpha=0.8, label='{} (AUC = {:.2f})'.format(model_name, roc_auc))
# reference line
ax.plot([0, 1], [0, 1], linestyle='--', lw=1, color='gray', alpha=.8)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver Operating Characteristic')
ax.legend(loc="lower right")
plt.show()
Another way is to list all the performance scores of the models
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
# compute performance
large_dt_acc = accuracy_score(y_test, y_pred_large_dt)
large_dt_f1 = f1_score(y_test, y_pred_large_dt)
large_dt_prec = precision_score(y_test, y_pred_large_dt)
large_dt_rec = recall_score(y_test, y_pred_large_dt)
small_dt_acc = accuracy_score(y_test, y_pred_small_dt)
small_dt_f1 = f1_score(y_test, y_pred_small_dt)
small_dt_prec = precision_score(y_test, y_pred_small_dt)
small_dt_rec = recall_score(y_test, y_pred_small_dt)
xgb_acc = accuracy_score(y_test, y_pred_xgb)
xgb_f1 = f1_score(y_test, y_pred_xgb)
xgb_prec = precision_score(y_test, y_pred_xgb)
xgb_rec = recall_score(y_test, y_pred_xgb)
# print in a table
performance_comparison = {'Model': ['y_pred_large_dt', 'y_pred_small_dt', 'y_pred_xgb'],
'Accuracy': [large_dt_acc, small_dt_acc, xgb_acc],
'F1-Score': [large_dt_f1, small_dt_f1, xgb_f1],
'Precision': [large_dt_prec, small_dt_prec, xgb_prec],
'Recall': [large_dt_rec, small_dt_rec, xgb_rec]}
df_performance_comparison = pd.DataFrame(performance_comparison)
print(df_performance_comparison)
Model Accuracy F1-Score Precision Recall 0 y_pred_large_dt 0.862846 0.668645 0.778776 0.585803 1 y_pred_small_dt 0.844788 0.606815 0.755521 0.507020 2 y_pred_xgb 0.874393 0.704864 0.792086 0.634945
From the table, it can be observed that the XGBoost model (y_pred_xgb) performs the best among the three models for all the metrics, followed by the large decision tree model (y_pred_large_dt), and the small decision tree model (y_pred_small_dt) performs the worst. This suggests that the XGBoost model is the best choice for this particular task. However, evaluation and testing on different tasks and datasets are necessary to determine which model is the best choice.
SHAP is used for global interpretation of models. In a SHAP summary plot, the features that are closer to the top have a greater influence on the model. The plot is composed of individual points.
import shap
shap.initjs()
# create SHAP values
explainer = shap.TreeExplainer(clf_xgb)
shap_values = explainer.shap_values(X_test)
# create summary plot
shap.summary_plot(shap_values, X_test, feature_names=X_train.columns)
It can be observed from the shap summary plot above that the top 5 influential features are respectively 'marital-status_ Married-civ-spouse', 'age', 'education-num', 'capital gain', 'hours-per-week'.
shap.force_plot(explainer.expected_value, shap_values[:1000,:], X_test.iloc[:1000,:])
With force plot above, we can see the predection of any individual point by putting the mouse on it.
SHAP is also used in this section for local interpret for XGBoost.
# plot the SHAP values for a single prediction
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0,:], feature_names=X_train.columns)
It can be observed from the shap plot above that the first sample in the dataset (real income: "< 50k") that this data is predicted to be income <50K with -5.84 SHAP value. The only positive feature iis sexfemale. Among the negative features, from high to low, are not being 'marital-status Married-civ-spouse', age '25', education-num '7', occupation 'Machine-op-inspct', marital-status: 'Never-married', hours-per-week: '40', not being 'race_ White', and capital-gain: '0'.
Image(filename='decision_tree_large.png')
Although decision tree is interpretable, it is difficult to interpret a large decision tree. Hence, We can use the featureimportances from scikit-learn to check the accurate weight of these features.
import numpy as np
importances = clf_large_dt.feature_importances_
indices = np.argsort(importances)[::-1][:10]
print("Feature ranking:")
for f in range(10):
print("%d. %s (%f)" % (f + 1, X_train.columns[indices[f]], importances[indices[f]]))
plt.figure(figsize=(8,6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [X_train.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Feature ranking: 1. marital-status_ Married-civ-spouse (0.394698) 2. education-num (0.214734) 3. capital-gain (0.193278) 4. capital-loss (0.062832) 5. age (0.048954) 6. hours-per-week (0.034893) 7. occupation_ Exec-managerial (0.011682) 8. fnlwgt (0.010176) 9. workclass_ Self-emp-not-inc (0.007098) 10. sex_ Female (0.006227)
Given that the large decision has 10 max depths, the 10 influencial features are from high to low respectively: 'marital-status Married-civ-spouse', 'education-num', 'capital-gain', 'capital-loss', 'age', 'hours-per-week', 'occupation Exec-managerial', 'fnlwgt', 'workclass Self-emp-not-inc', 'sex Female'.
Lime is used in this section for local interpret.
import lime
import lime.lime_tabular
# reset index
X_train = X_train.reset_index(drop=True)
# generate explainer
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names=X_train.columns,
class_names=[0,1], discretize_continuous=True)
# explain the point
i = np.random.RandomState(42).randint(0, X_test.shape[0])
exp = explainer.explain_instance(X_test.iloc[i], clf_large_dt.predict_proba, num_features=5)
# show the plot
exp.show_in_notebook(show_table=True, show_all = False)
From the lime framework above, it can be observed that this test sample is predicted to be 53% as '<=50k' and 47% as '>50k'. In detail, this percentage is composed of negative 70% for 'capital-gain' <= 0, positive 18% for 'marital-status_ Married-civ-spouse', negative 10% for 'capital-loss' <= 0, negative 7% for 'hours-per-week' <= 40, negative 4% for not being occupation_Exec-managerial.
A small decision tree can typically be treated as a globally-interpretable model because we can directly check its depth to understand the relative importance of features in the model.
Image(filename='decision_tree_small.png')
By observing the depth of the decision tree shown in this image, we can identify the top three most influential features based on the three deepest levels, which are marital-status_Married-civ-spouse, capital-gain, and education-num.
We can also use the featureimportances from scikit-learn to check the accurate weight of these features.
import numpy as np
importances = clf_small_dt.feature_importances_
indices = np.argsort(importances)[::-1][:3]
print("Feature ranking:")
for f in range(3):
print("%d. %s (%f)" % (f + 1, X_train.columns[indices[f]], importances[indices[f]]))
plt.figure(figsize=(8,6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [X_train.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Feature ranking: 1. marital-status_ Married-civ-spouse (0.524028) 2. capital-gain (0.245267) 3. education-num (0.229026)
The three features are marital-status_ Married-civ-spouse (0.524028), capital-gain (0.245267) and education-num (0.229026), which are the same as the large decision tree with different ranking.
Given that the small decision tree has only 3 max depth, we can easily go through the tree leaves step by step to interpret the model with the three features
Lime is used in this section for local interpret.
import lime
import lime.lime_tabular
# reset index
X_train = X_train.reset_index(drop=True)
# generate explainer
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names=X_train.columns,
class_names=[0,1], discretize_continuous=True)
# explain the point
i = np.random.RandomState(42).randint(0, X_test.shape[0])
exp = explainer.explain_instance(X_test.iloc[i], clf_small_dt.predict_proba, num_features=5)
# show the plot
exp.show_in_notebook(show_table=True, show_all = False)
It can be observed from LIME that this combination of data is predicted with a 70% chance of being classified as "<50K" income. This percentage is composed of 67% negative for capital-gain<=0, 30% positive for Married-civ-spouse, 5% negative for education-num > 9, 2% positive for marital-status not in 'other', and 1% negative for marital-status not in 'Never-married'.
Several conclusions can be drawn from the interpretations above:
One-hot encoded data are more likely to generate bias because the model may misunderstand the connection between features rather than continuous data, e.g., age. Therefore, this part will mainly focus on the one-hot encoded data races, sexes, marital statuses, occupations, and countries.
races = ['Black', 'White', 'Other']
sexes = ['Female', 'Male']
marital_statuses = ['Divorced', 'Married-civ-spouse', 'Never-married', 'Other']
occupations = ['Adm-clerical', 'Craft-repair', 'Exec-managerial', 'Machine-op-inspct',
'Other', 'Other-service', 'Prof-specialty', 'Sales']
countries = ['Other', 'United-States']
# define a function to print missclassification rate between two conbinations
from sklearn.metrics import confusion_matrix
# define a function to plot pictures of missclassification rate between two conbinations
def plot_combo_error_rate(model_pred, feat1, feat2, feat1_prefix, feat2_prefix):
misclassification_rates = []
for feature_1 in feat1:
for feature_2 in feat2:
idx = (X_test[feat1_prefix + feature_1] == 1) & (X_test[feat2_prefix + feature_2] == 1)
y_test_subset = np.array(y_test)[idx]
y_pred_subset = np.array(model_pred)[idx]
labels = clf_small_dt.classes_
cm = confusion_matrix(y_test_subset, y_pred_subset, labels=labels)
FP = cm[0, 1]
FN = cm[1, 0]
if np.sum(cm) == 0:
misclassification_rate = np.nan
else:
misclassification_rate = (FP + FN) / np.sum(cm)
misclassification_rates.append((feature_1, feature_2, misclassification_rate))
misclassification_df = pd.DataFrame(misclassification_rates, columns=['feature1', 'feature2', 'Misclassification Rate'])
fig, ax = plt.subplots(figsize=(16, 2))
misclassification_df.groupby(['feature1', 'feature2']).mean()['Misclassification Rate'].unstack().plot.bar(ax=ax)
ax.set_title('Misclassification Rates by '+feat1_prefix[:-2]+' and '+feat2_prefix[:-2])
ax.set_xlabel(feat1_prefix[:-2])
ax.set_ylabel('Misclassification Rate')
plt.xticks(rotation=0)
plt.show()
def print_false_rates_table(model_pred, feat1, feat2, feat1_prefix, feat2_prefix):
false_positive_rates = []
false_negative_rates = []
for feature_1 in feat1:
for feature_2 in feat2:
idx = (X_test[feat1_prefix + feature_1] == 1) & (X_test[feat2_prefix + feature_2] == 1)
y_test_subset = np.array(y_test)[idx]
y_pred_subset = np.array(model_pred)[idx]
labels = clf_small_dt.classes_
cm = confusion_matrix(y_test_subset, y_pred_subset, labels=labels)
TN = cm[0, 0]
FP = cm[0, 1]
FN = cm[1, 0]
TP = cm[1, 1]
if (FP + TN) == 0:
false_positive_rate = np.nan
else:
false_positive_rate = FP / (FP + TN)
if (FN + TP) == 0:
false_negative_rate = np.nan
else:
false_negative_rate = FN / (FN + TP)
false_positive_rates.append((feature_1, feature_2, false_positive_rate))
false_negative_rates.append((feature_1, feature_2, false_negative_rate))
# transfer to dataframe
false_positive_df = pd.DataFrame(false_positive_rates, columns=['feature1', 'feature2', 'False Positive Rate'])
false_negative_df = pd.DataFrame(false_negative_rates, columns=['feature1', 'feature2', 'False Negative Rate'])
# merge the dataframes
rates_df = pd.merge(false_positive_df, false_negative_df, on=['feature1', 'feature2'])
# print in table
print('{:<20s}{:<20s}{:<20s}{:<20s}'.format(feat1_prefix[:-2], feat2_prefix[:-2], 'False Positive Rate', 'False Negative Rate'))
for index, row in rates_df.iterrows():
print('{:<20s}{:<20s}{:<20.6f}{:<20.6f}'.format(row['feature1'], row['feature2'], row['False Positive Rate'], row['False Negative Rate']))
The funstions will plot a table or diagram to see the misclassification rate with any two features combination
plot_combo_error_rate(y_pred_xgb,races, sexes, 'race_ ', 'sex_ ')
plot_combo_error_rate(y_pred_xgb,marital_statuses, sexes, 'marital-status_ ', 'sex_ ')
plot_combo_error_rate(y_pred_xgb,occupations, sexes, 'occupation_ ', 'sex_ ')
plot_combo_error_rate(y_pred_xgb,countries, sexes, 'native-country_ ', 'sex_ ')
From the combinations above, it can be clearly observed that males are more biased than females, as males are indicated to have a higher misclassification rate. This can be caused by several reasons. After observing the initial data separately, the reasons can be:
plot_combo_error_rate(y_pred_xgb,sexes, races, 'sex_ ', 'race_ ')
plot_combo_error_rate(y_pred_xgb,marital_statuses, races, 'marital-status_ ', 'race_ ')
plot_combo_error_rate(y_pred_xgb,occupations, races, 'occupation_ ', 'race_ ')
plot_combo_error_rate(y_pred_xgb,countries, races, 'native-country_ ', 'race_ ')
From the combinations above, it can be observed that 'race_Other' has the highest misclassification rate, while 'race_White' has a higher misclassification rate than 'race_Black' in most cases. After observing the initial data separately, the reasons could be:
plot_combo_error_rate(y_pred_xgb,sexes, marital_statuses, 'sex_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_xgb,races, marital_statuses, 'race_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_xgb,occupations, marital_statuses, 'occupation_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_xgb,countries, marital_statuses, 'native-country_ ', 'marital-status_ ')
After examining the misclassification rates of different marital statuses, it is noticeable that the category 'Married-civ-spouse' has a significantly higher misclassification rate than the other categories. This feature is also the most influential in the three models. However, there is no specific combination or data imbalance that can account for this misclassification. Therefore, it is necessary to further examine the false positive rate and false negative rate of the 'Married-civ-spouse' category to better understand the bias.
print_false_rates_table(y_pred_xgb,sexes, marital_statuses, 'sex_ ', 'marital-status_ ')
sex marital-status False Positive Rate False Negative Rate Female Divorced 0.003241 0.684211 Female Married-civ-spouse 0.135193 0.260504 Female Never-married 0.001272 0.688172 Female Other 0.003881 0.688889 Male Divorced 0.013369 0.460177 Male Married-civ-spouse 0.149246 0.335264 Male Never-married 0.003176 0.570470 Male Other 0.013263 0.508475
print_false_rates_table(y_pred_xgb,races, marital_statuses, 'race_ ', 'marital-status_ ')
race marital-status False Positive Rate False Negative Rate Black Divorced 0.004785 0.533333 Black Married-civ-spouse 0.103203 0.468966 Black Never-married 0.000000 0.600000 Black Other 0.000000 0.555556 White Divorced 0.007652 0.563830 White Married-civ-spouse 0.149986 0.318652 White Never-married 0.002828 0.614679 White Other 0.006920 0.586207 Other Divorced 0.000000 0.600000 Other Married-civ-spouse 0.169399 0.360000 Other Never-married 0.000000 0.642857 Other Other 0.029851 0.625000
print_false_rates_table(y_pred_xgb,occupations, marital_statuses, 'occupation_ ', 'marital-status_ ')
occupation marital-status False Positive Rate False Negative Rate Adm-clerical Divorced 0.002755 0.600000 Adm-clerical Married-civ-spouse 0.144876 0.469027 Adm-clerical Never-married 0.000000 0.866667 Adm-clerical Other 0.000000 0.800000 Craft-repair Divorced 0.005000 0.600000 Craft-repair Married-civ-spouse 0.049102 0.661098 Craft-repair Never-married 0.000000 0.700000 Craft-repair Other 0.000000 0.700000 Exec-managerial Divorced 0.041096 0.447761 Exec-managerial Married-civ-spouse 0.501348 0.090446 Exec-managerial Never-married 0.007463 0.559322 Exec-managerial Other 0.056818 0.482759 Machine-op-inspct Divorced 0.000000 0.666667 Machine-op-inspct Married-civ-spouse 0.016393 0.696429 Machine-op-inspct Never-married 0.000000 0.600000 Machine-op-inspct Other 0.000000 0.500000 Other Divorced 0.000000 0.736842 Other Married-civ-spouse 0.045640 0.613208 Other Never-married 0.000734 0.565217 Other Other 0.003268 0.692308 Other-service Divorced 0.000000 0.555556 Other-service Married-civ-spouse 0.003195 0.673077 Other-service Never-married 0.000000 0.600000 Other-service Other 0.000000 1.000000 Prof-specialty Divorced 0.014778 0.584906 Prof-specialty Married-civ-spouse 0.611650 0.069612 Prof-specialty Never-married 0.011494 0.623656 Prof-specialty Other 0.027397 0.500000 Sales Divorced 0.000000 0.625000 Sales Married-civ-spouse 0.213759 0.315914 Sales Never-married 0.003120 0.593750 Sales Other 0.000000 0.714286
print_false_rates_table(y_pred_xgb,countries, marital_statuses, 'native-country_ ', 'marital-status_ ')
native-country marital-status False Positive Rate False Negative Rate Other Divorced 0.000000 0.600000 Other Married-civ-spouse 0.104000 0.350746 Other Never-married 0.000000 0.785714 Other Other 0.006135 0.666667 United-States Divorced 0.007555 0.559585 United-States Married-civ-spouse 0.153697 0.325066 United-States Never-married 0.002560 0.593458 United-States Other 0.007092 0.576087
print(sum(y_train))
7841
It can be concluded from the tables above that the model has a significantly higher False Negative Rate than False Positive Rate, with most values being over 50%. Upon checking the positive results of the target variable in the dataset, it appears that there is an imbalance in the collection of the target variable, which may be a cause for the high False Negative Rate.
plot_combo_error_rate(y_pred_xgb,sexes, countries, 'sex_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_xgb,races, countries, 'race_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_xgb,marital_statuses, countries, 'marital-status_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_xgb,occupations, countries, 'occupation_ ', 'native-country_ ')
Although the countries data are not well-balanced, no specific bias is observed in this case.
plot_combo_error_rate(y_pred_xgb,sexes, occupations, 'sex_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_xgb,races, occupations, 'race_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_xgb,marital_statuses, occupations, 'marital-status_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_xgb,countries, occupations, 'native-country_ ', 'occupation_ ')
Given that the occupation has the most significant impact on the prediction, no particular bias is observed in this case.
plot_combo_error_rate(y_pred_large_dt,races, sexes, 'race_ ', 'sex_ ')
plot_combo_error_rate(y_pred_large_dt,marital_statuses, sexes, 'marital-status_ ', 'sex_ ')
plot_combo_error_rate(y_pred_large_dt,occupations, sexes, 'occupation_ ', 'sex_ ')
plot_combo_error_rate(y_pred_large_dt,countries, sexes, 'native-country_ ', 'sex_ ')
plot_combo_error_rate(y_pred_large_dt,sexes, races, 'sex_ ', 'race_ ')
plot_combo_error_rate(y_pred_large_dt,marital_statuses, races, 'marital-status_ ', 'race_ ')
plot_combo_error_rate(y_pred_large_dt,occupations, races, 'occupation_ ', 'race_ ')
plot_combo_error_rate(y_pred_large_dt,countries, races, 'native-country_ ', 'race_ ')
plot_combo_error_rate(y_pred_large_dt,sexes, marital_statuses, 'sex_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_large_dt,races, marital_statuses, 'race_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_large_dt,occupations, marital_statuses, 'occupation_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_large_dt,countries, marital_statuses, 'native-country_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_large_dt,sexes, countries, 'sex_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_large_dt,races, countries, 'race_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_large_dt,marital_statuses, countries, 'marital-status_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_large_dt,occupations, countries, 'occupation_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_large_dt,sexes, occupations, 'sex_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_large_dt,races, occupations, 'race_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_large_dt,marital_statuses, occupations, 'marital-status_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_large_dt,countries, occupations, 'native-country_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_small_dt,races, sexes, 'race_ ', 'sex_ ')
plot_combo_error_rate(y_pred_small_dt,marital_statuses, sexes, 'marital-status_ ', 'sex_ ')
plot_combo_error_rate(y_pred_small_dt,occupations, sexes, 'occupation_ ', 'sex_ ')
plot_combo_error_rate(y_pred_small_dt,countries, sexes, 'native-country_ ', 'sex_ ')
plot_combo_error_rate(y_pred_small_dt,sexes, races, 'sex_ ', 'race_ ')
plot_combo_error_rate(y_pred_small_dt,marital_statuses, races, 'marital-status_ ', 'race_ ')
plot_combo_error_rate(y_pred_small_dt,occupations, races, 'occupation_ ', 'race_ ')
plot_combo_error_rate(y_pred_small_dt,countries, races, 'native-country_ ', 'race_ ')
plot_combo_error_rate(y_pred_small_dt,sexes, marital_statuses, 'sex_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_small_dt,races, marital_statuses, 'race_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_small_dt,occupations, marital_statuses, 'occupation_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_small_dt,countries, marital_statuses, 'native-country_ ', 'marital-status_ ')
plot_combo_error_rate(y_pred_small_dt,sexes, countries, 'sex_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_small_dt,races, countries, 'race_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_small_dt,marital_statuses, countries, 'marital-status_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_small_dt,occupations, countries, 'occupation_ ', 'native-country_ ')
plot_combo_error_rate(y_pred_small_dt,sexes, occupations, 'sex_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_small_dt,races, occupations, 'race_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_small_dt,marital_statuses, occupations, 'marital-status_ ', 'occupation_ ')
plot_combo_error_rate(y_pred_small_dt,countries, occupations, 'native-country_ ', 'occupation_ ')
It can be observed that there are similar biases in all the three models as explained in the error analysis of XGBoost, mainly on the features 'sex_male', 'race_other', 'race_white', and 'marital-status_Married-civ-spouse'. These biases may be caused by several reasons, such as imbalance in features, imbalance in targets, or possibly wrong connections between unrelated features by the model.
To mitigate these biases, some sensible solutions can be considered, such as:
Among the three models, trade-offs between bias and overall accuracy are also observed. The XGBoost model has the highest overall accuracy, but also has the most significant bias. On the other hand, the small decision tree model has the lowest overall accuracy, but also has the least significant bias. It is important to carefully balance accuracy and bias when choosing a model. Overall, the small decision tree is the fairest model and the XGBoost is the least fair.
After preprocessing the data, we end up with two types of data in the dataset: continuous data with integer values (e.g. age, education-num) and one-hot encoded data which are binary values (e.g. race, occupation). These two types of data need to be treated separately in order to perform the what-if analysis.
# First define a function with input of classifier
import random
def cont_what_if(clf, feature, value_added, samples=1000):
random_indices = random.sample(range(len(X_test)), samples)
X_what_if = X_test.iloc[random_indices].copy()
X_what_if[feature] += value_added
y_pred = clf.predict(X_what_if)
y_orig = clf.predict(X_test.iloc[random_indices])
change = (y_pred != y_orig).mean()
if np.mean(y_pred - y_orig) < 0:
direction = "decreased"
else:
direction = "increased"
print("Proportion of predictions changed after changing {} by {} for the sample: {} {:.2f}%"
.format(feature, value_added,direction, change * 100))
We can use the function to see if one of these continous value features would happen after adding or minusing a certain value. The output would be the percentage change in predictions.
The function randomly picks 1000 samples at a time, hence the continous values will be running three times to see if they fits the linear pattern as designed.
cont_what_if(clf_xgb, 'age', 10, 1000)
cont_what_if(clf_xgb, 'age', -5, 1000)
cont_what_if(clf_xgb, 'age', 15, 1000)
Proportion of predictions changed after changing age by 10 for the sample: increased 3.90% Proportion of predictions changed after changing age by -5 for the sample: decreased 2.80% Proportion of predictions changed after changing age by 15 for the sample: increased 4.90%
cont_what_if(clf_small_dt, 'age', 10, 1000)
cont_what_if(clf_small_dt, 'age', -5, 1000)
cont_what_if(clf_small_dt, 'age', 15, 1000)
Proportion of predictions changed after changing age by 10 for the sample: increased 0.10% Proportion of predictions changed after changing age by -5 for the sample: increased 0.00% Proportion of predictions changed after changing age by 15 for the sample: increased 0.00%
cont_what_if(clf_large_dt, 'age', 10, 1000)
cont_what_if(clf_large_dt, 'age', -5, 1000)
cont_what_if(clf_large_dt, 'age', 15, 1000)
Proportion of predictions changed after changing age by 10 for the sample: increased 1.40% Proportion of predictions changed after changing age by -5 for the sample: decreased 2.20% Proportion of predictions changed after changing age by 15 for the sample: increased 1.80%
By changing the age value, it can be observed a linear change in prediction rate as expected, where the prediction rate increased as age increased and decreased as age decreased. Furthermore, the sensitivity of the prediction change between the models confirmed the interpretation that age is the second most influential feature in XGBoost, fifth in the large decision tree, and not present in the small decision tree. This is reflected in the largest change rate with clf_xgb and no change in clf_small_dt.
Given that both decision trees haven't cover all features, the what-if will run on clf_xgb in the following parts.
cont_what_if(clf_xgb, 'education-num', 2, 1000)
cont_what_if(clf_xgb, 'education-num', 3, 1000)
cont_what_if(clf_xgb, 'education-num', -2, 1000)
Proportion of predictions changed after changing education-num by 2 for the sample: increased 6.40% Proportion of predictions changed after changing education-num by 3 for the sample: increased 10.70% Proportion of predictions changed after changing education-num by -2 for the sample: decreased 6.70%
cont_what_if(clf_xgb, 'hours-per-week', 5, 1000)
cont_what_if(clf_xgb, 'hours-per-week', 10, 1000)
cont_what_if(clf_xgb, 'hours-per-week', -5, 1000)
Proportion of predictions changed after changing hours-per-week by 5 for the sample: increased 2.50% Proportion of predictions changed after changing hours-per-week by 10 for the sample: increased 2.20% Proportion of predictions changed after changing hours-per-week by -5 for the sample: decreased 2.00%
It can be observed that the model treats education-num and hours-per-week as linear as predicted, the predection increases as these values increase, decreases as these values decrease.
cont_what_if(clf_xgb, 'capital-gain', 2000, 1000)
cont_what_if(clf_xgb, 'capital-gain', 4000, 1000)
cont_what_if(clf_xgb, 'capital-gain', 6000, 1000)
Proportion of predictions changed after changing capital-gain by 2000 for the sample: decreased 13.00% Proportion of predictions changed after changing capital-gain by 4000 for the sample: decreased 14.80% Proportion of predictions changed after changing capital-gain by 6000 for the sample: increased 29.20%
cont_what_if(clf_xgb, 'capital-loss', 2000, 1000)
cont_what_if(clf_xgb, 'capital-loss', 4000, 1000)
cont_what_if(clf_xgb, 'capital-loss', 6000, 1000)
Proportion of predictions changed after changing capital-loss by 2000 for the sample: decreased 12.70% Proportion of predictions changed after changing capital-loss by 4000 for the sample: increased 18.60% Proportion of predictions changed after changing capital-loss by 6000 for the sample: increased 20.90%
Given that most capital-gain and capital-loss are 0, it can be concluded that the model treats a small positive capital-gain or negative capital-loss as a potential increase of income, while a large positive capital-gain or negative capital-loss as a potential decrease of income. This observation may indicate that the model is considering the presence or absence of investment gains or losses as a proxy for wealth or financial stability, and assuming that individuals with higher wealth or financial stability are more likely to have higher income. However, this assumption may not hold true for all cases, and further investigation is needed to fully understand the impact of capital-gain and capital-loss on the model's prediction.
cont_what_if(clf_xgb, 'fnlwgt', -2000, 1000)
cont_what_if(clf_xgb, 'fnlwgt', 4000, 1000)
cont_what_if(clf_xgb, 'fnlwgt', 6000, 1000)
Proportion of predictions changed after changing fnlwgt by -2000 for the sample: decreased 0.20% Proportion of predictions changed after changing fnlwgt by 4000 for the sample: decreased 0.20% Proportion of predictions changed after changing fnlwgt by 6000 for the sample: increased 0.60%
As expected, fnlwgt has low influence on the model.
# First define a function with input of classifier
import random
def bin_what_if(clf, feature_prefix, feature, target_feature, samples=1000):
X_what_if_before_pred = X_test [X_test[feature_prefix + feature] == 1]
random_indices = random.sample(range(len(X_what_if_before_pred)), samples)
X_what_if = X_what_if_before_pred.iloc[random_indices].copy()
X_what_if[feature_prefix + target_feature] = 1
X_what_if[feature_prefix + feature] = 0
y_pred = clf.predict(X_what_if)
y_orig = clf.predict(X_what_if_before_pred.iloc[random_indices])
change = (y_pred != y_orig).mean()
if np.mean(y_pred - y_orig) < 0:
direction = "decreased"
else:
direction = "increased"
print("Proportion of predictions changed after changing {} with {} for the sample: {} {:.2f}%"
.format(feature, target_feature, direction, change * 100))
This function would randomly select on data having a particular feature and then replace it with target_feature to see the difference in prediction after changing.
bin_what_if(clf_xgb, 'sex_ ', 'Male', 'Female', 1000)
bin_what_if(clf_xgb, 'sex_ ', 'Female', 'Male', 1000)
Proportion of predictions changed after changing Male with Female for the sample: increased 2.90% Proportion of predictions changed after changing Female with Male for the sample: decreased 1.60%
It can be observed that female are predicted to be more likely to earn '>50K' than male slightly
bin_what_if(clf_xgb, 'race_ ', 'Black', 'White', 1000)
bin_what_if(clf_xgb, 'race_ ', 'White', 'Black', 1000)
Proportion of predictions changed after changing Black with White for the sample: increased 0.60% Proportion of predictions changed after changing White with Black for the sample: decreased 2.50%
It can be observed that race_white are predicted to be more likely to earn '>50K' than race_black slightly.
bin_what_if(clf_xgb, 'marital-status_ ', 'Married-civ-spouse', 'Divorced', 1000)
bin_what_if(clf_xgb, 'marital-status_ ', 'Divorced', 'Married-civ-spouse', 1000)
bin_what_if(clf_xgb, 'marital-status_ ', 'Married-civ-spouse', 'Never-married', 1000)
bin_what_if(clf_xgb, 'marital-status_ ', 'Never-married', 'Married-civ-spouse', 1000)
Proportion of predictions changed after changing Married-civ-spouse with Divorced for the sample: decreased 29.50% Proportion of predictions changed after changing Divorced with Married-civ-spouse for the sample: increased 24.80% Proportion of predictions changed after changing Married-civ-spouse with Never-married for the sample: decreased 25.70% Proportion of predictions changed after changing Never-married with Married-civ-spouse for the sample: increased 13.00%
It can be observed that Married-civ-spouse are predicted to be much more likely to earn '>50K' than any other marital status. In this case, this is the primary feature in ranking in the interpretation conducted before
bin_what_if(clf_xgb, 'native-country_ ', 'United-States', 'Other', 1000)
bin_what_if(clf_xgb, 'native-country_ ', 'Other', 'United-States', 1000)
Proportion of predictions changed after changing United-States with Other for the sample: decreased 2.00% Proportion of predictions changed after changing Other with United-States for the sample: increased 1.90% Proportion of predictions changed after changing Other with United-States for the sample: increased 1.90%
It can be observed that native-country_United-States are predicted to bemore likely to earn '>50K' than native-country_United-States slightly.
In conclusion, the what-if step is a useful technique for detecting bias in machine learning models. By changing the values of certain features, we can observe how the model's predictions are affected, and this can help us identify which features are most influential and which groups of people may be affected by bias. Through what-if analysis, we can also see the extent and direction of bias in the model and make adjustments to mitigate it. Overall, the what-if technique enables us to gain a deeper understanding of the biases in our models and take steps to ensure that they are fair and accurate.
The UCI Adult dataset is available in the UCI Machine Learning Repository, which is a popular repository for machine learning datasets. The repository is managed by the Department of Information and Computer Sciences at the University of California, Irvine. It offers access to a vast collection of machine learning datasets, including the UCI Adult dataset, and is a useful resource for researchers and professionals who want to utilize machine learning and data analysis techniques.
Model Details:
https://www.educba.com/decision-tree-hyperparameters/
https://pdfs.semanticscholar.org/7b99/2c46800f8913c251259c1d981a7695105c5a.pdf
Intended Use:
Factors:
Metrics:
Evaluation Data:
Training Data:
Quantitative Analyses:
Ethical Considerations:
Caveats and Recommendations: